Rafiki: A Middleware for Parameter Tuning of NoSQL Datastores for Dynamic Metagenomics Workloads
ثبت نشده
چکیده
High performance computing (HPC) applications, such as metagenomics and other big data systems, need to store and analyze huge volumes of semi-structured data. Such applications oen rely on NoSQL-based datastores, and optimizing these databases is a challenging endeavor, with over 50 conguration parameters in Cassandra alone. As the application executes, database workloads can change rapidly from read-heavy to write-heavy ones, and a system tuned with a read-optimized conguration becomes suboptimal when the workload becomes write-heavy. In this paper, we present a method and a system for optimizing NoSQL congurations for Cassandra and ScyllaDB when running HPC and metagenomics workloads. First, we identify the signicance of conguration parameters using ANOVA. Next, we apply neural networks using the most signicant parameters and their workload-dependent mapping to predict database throughput, as a surrogate model. en, we optimize the conguration using genetic algorithms on the surrogate to maximize the workloaddependent performance. Using the proposed methodology in our system (Rafiki), we can predict the throughput for unseen workloads and conguration values with an error of 7.5% for Cassandra and 6.9-7.8% for ScyllaDB. Searching the conguration spaces using the trained surrogate models, we achieve performance improvements of 41% for Cassandra and 9% for ScyllaDB over the default conguration with respect to a read-heavy workload, and also signicant improvement for mixed workloads. In terms of searching speed, Rafiki, using only 1/1000-th of the searching time of exhaustive search, reaches within 15% and 9.5% of the theoretically best achievable performances for Cassandra and ScyllaDB, respectively— supporting optimizations for highly dynamic workloads. ACM Reference format: Elided. 2016. Rafiki: A Middleware for Parameter Tuning of NoSQL Datastores for Dynamic Metagenomics Workloads. In Proceedings of ACM Conference, Washington, DC, USA, July 2017 (Conference’17), 13 pages. DOI: 10.1145/nnnnnnn.nnnnnnn
منابع مشابه
Benchmarking Replication in Cassandra and MongoDB NoSQL Datastores
The proliferation in Web 2.0 applications has increased the volume, velocity, and variety of data sources which have exceeded the limitations and expected use cases of traditional relational DBMSs. Cloud serving NoSQL data stores address these concerns and provide replication mechanisms to ensure fault tolerance, high availability, and improved scalability. In this paper, we empirically explore...
متن کاملA Comparison of Data Models and APIs of NoSQL Datastores
NoSQL datastore systems are a new generation of non-relational databases. More than fifty NoSQL systems have been already implemented, each with different characteristics — especially, with different data models and different APIs to access the data. In this paper we describe and compare the data models and operations offered by a number of representative NoSQL datastores, which we have directl...
متن کاملRangeMerge: Online Performance Tradeoffs in NoSQL Datastores
Datastores are distributed systems that manage enormous amounts of structured data for online serving and batch processing applications. The NoSQL datastores weaken the traditional relational and transactional model in favor of horizontal scalability. They usually support concurrent operations with demanding throughput and latency requirements which may vary across different workload types. A t...
متن کاملA Simple Approach for Executing SQL on a NoSQL Datastore
NoSQL datastores have been initially introduced to support a few concrete extreme scale applications. Limited query and indexing capabilities were therefore not a major impediment, as the specificity and scale of the target application justified the investment in manually crafting application code. With a number of alternatives now available and mature, there is an increasing willingness to use...
متن کاملAccess control in ultra-large-scale systems using a data-centric middleware
The primary characteristic of an Ultra-Large-Scale (ULS) system is ultra-large size on any related dimension. A ULS system is generally considered as a system-of-systems with heterogeneous nodes and autonomous domains. As the size of a system-of-systems grows, and interoperability demand between sub-systems is increased, achieving more scalable and dynamic access control system becomes an im...
متن کامل